Creates a linear regression model where lprice is the dependent variable and points and cherry are the independent variables. Essentially we are predicting the lprice using points and cherry.
We use the predict function to generate predictions from the linear model m1 using the wine dataset.
Calculate the residuals by subtracting the predicted values from the actual log prices.
Compute the Root Mean Squared Error (RMSE) to evaluate the model’s performance.
RMSE: The RMSE measures how effective our model is. A lower RMSE indicates a better fit of the model to the data, meaning the predictions are closer to the actual values. In this case, the RMSE value is 0.4687657, which suggests how well our model is performing.
Creates a linear regression model where lprice is the dependent variable and points, cherry, and their interaction (points * cherry) are the independent variables. Essentially we are predicting the lprice using points, cherry, and their interaction (points * cherry).
We use the predict function to generate predictions from the linear model m2 using the wine dataset.
Calculate the residuals by subtracting the predicted values from the actual log prices.
Compute the Root Mean Squared Error (RMSE) to evaluate the model’s performance.
RMSE: The RMSE measures how effective our model is. A lower RMSE indicates a better fit of the model to the data, meaning the predictions are closer to the actual values. In this case, the RMSE value is 0.4685223, which suggests how well our model is performing.
The Interaction Variable
summary(m2)
Call:
lm(formula = lprice ~ points * cherry, data = wine)
Residuals:
Min 1Q Median 3Q Max
-1.6432 -0.3332 -0.0151 0.2924 3.9645
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.659620 0.102252 -55.350 < 2e-16 ***
points 0.102225 0.001149 88.981 < 2e-16 ***
cherry -1.014896 0.215812 -4.703 2.58e-06 ***
points:cherry 0.012663 0.002409 5.256 1.48e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4686 on 26580 degrees of freedom
Multiple R-squared: 0.3062, Adjusted R-squared: 0.3061
F-statistic: 3910 on 3 and 26580 DF, p-value: < 2.2e-16
The coefficient of the interaction term (points:cherry) is positive, indicating that the effect of points on the log of price increases when the wine description contains the word “cherry”. This suggests that wines described with “cherry” tend to have a higher price for the same number of points compared to those without the “cherry” description.
Applications
Determine which province (Oregon, California, or New York), does the ‘cherry’ feature in the data affect price most?
We create 3 subsets of the data for the 3 provinces.
We perform a linear regression model on all 3 sets of data.
We get the coefficents from the summary function.
The higher the coefficient the more important the cherry feature is.
The coefficients for the ‘cherry’ feature in each province are as follows:
Oregon: 0.2203241
California: 0.0956657
New York: 0.1517924
The ‘cherry’ feature has the highest positive effect on the log of price in Oregon. This suggests that in this province, wines described with “cherry” tend to have a higher price compared to those without the “cherry” description.
Scenarios
On Accuracy
Imagine a model to distinguish New York wines from those in California and Oregon. After a few days of work, you take some measurements and note: “I’ve achieved 91% accuracy on my model!”
The baseline accuracy is a little under 72% which is significantly lower than the 91% we are getting with our model. Therefore it is okay to be impressed by model since it is doing better than always predicating the most likely outcome.
On Ethics
Why is understanding this vignette important to use machine learning in an ethical manner?
It’s important to undertsnad the context around our models. High accuracy doesn’t mean much if it isn’t actually that much better than the baseline accuracy. It’s important for us to be able to evaluate the quality of our models rather than just looking at things like RMSE and P Values and assuming our models are doing a great job without underestanding the context of the data we are using. In theory this will also lead to us creating more intersting and more effective models.
Ignorance is no excuse
Imagine you are working on a model to predict the likelihood that an individual loses their job as the result of the changing federal policy under new presidential administrations. You have a very large dataset with many hundreds of features, but you are worried that including indicators like age, income or gender might pose some ethical problems. When you discuss these concerns with your boss, she tells you to simply drop those features from the model. Does this solve the ethical issue? Why or why not?
It doesn’t solve the the ethical issues for a couple of reasons. First, there could be proxy variables in the dataset so even if we remove thing likes age, income, or gender we still might end up discriminating based on those things with other variables like zipcode for example. Second, even if those variables don’t have proxies we still might be descriminatory because in the past people have been discriminatory and our model learns from those human biases. Instead we might actually want to include those variables so that we can then so how much they are affecting the model and adjust it accordingly to be more fair than if we had just removed them.